Shantenu Jha and Andre Luckow
The tutorial material is available as iPython notebook on Github:
For the purpose of this tutorial we setup a Hadoop cluster and iPython Notebook environment on Amazon Web Services (not active after tutorial):
Enclosed a list of dependency for installation on other machines:
We recommend to use Anaconda.
We begin with an overview of using Hadoop and Spark:
Hadoop MapReduce: Link to Notebook
Spark: Link to Notebook
The Pilot-Abstraction has been used to execute task-based workloads on distributed resources. A Pilot-Job is a placeholder job that is submitting to the resource management system and is used as a container for a dynamically determined set of compute tasks. The Pilot-Data abstraction extends the Pilot-Abstraction for supporting the management of data in conjunction with compute tasks.
The Pilot-Abstraction supports heterogeneous resources, including cloud, HPC, and Hadoop resources.
The following example demonstrates how the Pilot-Abstraction is used to manage a set of compute tasks.
The following pairplots show the scatter-plot between each of the four features. Clusters for the different species are indicated by the color.